Collecting YouTube Data

Session 2c (1)

Author
Affiliation

Zixi Chen, PhD

NYU-Shanghai

Published

November 13, 2025

1 Scraping politeness

Make sure you know what you can scrap and what you cannot by reading the platform’s official documents. Most of the platforms would have a “platform.robots.txt” page. Here is the one for YouTube: https://www.youtube.com/robots.txt. See Fig. 1 below.

Fig.1 robots.txt of YouTube

YouTube also has a very detailed official Data API document, where you can find the data you can scrape from this API in its Reference file. We will take a look at it together later.

  1. If you plan to scrape a considerable amount of data using R, you should use the polite package to seek permission and take pauses in data requests. In this class, we only obtain a small amount of data, so we won’t use this package. But check out how to use polite here and here if you have a heavy data scraping task.
ATTENTION

In this class, I will demonstrate the codes for scrapping. If many of us run those codes at the same time, we are highly likely to get access denied. To avoid this issue, I will provide the scraped data. You can download it and load it in your Rstudio, and avoid of running any scraping related codes (I will note which are codes for scraping). With the provided scraped data, you can run other non-scraping codes for practice alongside the class. You can try the codes for scrapping after the class and scrape the videos of interest.

Download the scraped data here. And load the data with the following codes:

load("your directory/YT_scraped_Data.RData")

2 A simple example

Instead of writing our own API applications to access the YouTube data, we use two off-the-shelf R packages, tuber and vosonSML, which provide handy functions to quickly scrape video and user data.

Of course, since these two packages were written for the authors’ own interest, you may not find the YouTube data you want to use. In this case, you may want to write your own applications. You can find an awesome example of writing API calls using R here. For teaching purposes, we will use these two handy R packages.

#install.packages(c("vosonSML", "tuber", "httpuv")) 

library(vosonSML) # we use it for collecting comments data
library(tuber) # we use if for collecting video and channel attributes. 

library(tidyverse) # we use it for data wrangling 
Warning: package 'tidyverse' was built under R version 4.3.3
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'tidyr' was built under R version 4.3.3
Warning: package 'dplyr' was built under R version 4.3.3
library(httpuv) # we use it for handling HTTP and WebSocket requests.
Warning: package 'httpuv' was built under R version 4.3.3

In the following, we use two YouTube videos to show a basic workflow for extracting video and channel information. These two videos were created by David Robinson and Julia Silge, who are data scientiests and the authors of Text Mining with R. They are also R educators who have been active in the weekly R social learning event TidyTuesday for years. Check it out if you want to practice and improve your R skills.

2.1 Gain access permission

In last class, we gained our autentication information, including API key, client ID, and client secret. We also used the following chunk to input those authentication information in the .Renviron file. Double check if you have the autentification information stored correctly. This is a practice of secure your authentication information. See more about the best practices of using APIs in R suggested by Chung-hong Chan.

#install.packages("usethis")
#library(usethis)
usethis::edit_r_environ(scope = "user")

Now, we can call for API permissions to gain data access and close the .Renviron file. We then run the following chunks in R scripts.

YT_Auth<- vosonSML::Authenticate("youtube", 
                                      apiKey=Sys.getenv("API_key")) # 
tuber::yt_oauth(app_id = Sys.getenv("Client_ID"),
                app_secret = Sys.getenv("Client_Secret"),
                token = '')

Then a message would show in your console. Then put “1” in console, directing to a webpage requesting access to your account. Select yes and you should see an message saying the authentication is granted.

Errors?

If you received an error message saying your request of making API is abusive, then copy your Project ID in App Name. This should solve this issue.

2.2 Video comments

2.2.1 Identification numbers

To begin with, we create a vector of the videos’ URLs to get the videos’ unique identification numbers (IDs). In a YouTube video URL, the video’s ID is the string of numbers and letters after =. When using vosonSML , we don’t need to manually input video IDs since it has a built-in function to extract IDs from the URLs. You only need to create a vector of URLs.

videoIDs.url <- c("https://www.youtube.com/watch?v=d7uFpxtiXyk",
                  "https://www.youtube.com/watch?v=dbXDkEEuvCU")

What information about video comments can we retrieve from YouTube’s API? We can find it out from the API’s documentation. The screenshot was taken from the reference page for “Comments”. The terms on the left side of the colons and in green, such as “kind,” “etag,” “id,” and “snippet,” and so on, are the variables we can request from the API.

You can also see that more information is included within some parts, such as “authorDisplayName” in “snippet.”. This is a typical nested or tree structure of raw scraped data. In R, this kind of data will be presented as listed data and need further wrangling.

No worries for now since most of the functions that we use today automatically unnest the listed data. In this case, we will have a nice dataframe to work with. The unnested variable, such as “authorDisplayName” in “snippet,” will be shown as “snippet.authorDisplayName.”

Fig 2. API resource representation of YouTube comments

You can further find the codebook explaining these variables in the following “Properties” table on that reference page. For example, “id” is a string variable, indicating “the ID that YouTube uses to uniquely identify the comment.”

Fig 3. API resource properties of YouTube comments
# codes for scrapping; hold your horses. 
comments.df <- YT_Auth  %>% # this is the credential we have created with API
  Collect(videoIDs.url,
          maxComments = 500, # for teaching purposes, I set it as 500. 
          writeToFile = T # If TRUE, this collection of comment data will be saved to your work directory as an ".rds" file. 
          )
Tip

You can use the readRDS function to read in the “.rds” file. For example:

comments.df <- readRDS("your file directory/2024-02-27_061830-YoutubeData.rds")

Previously, we used load() to read in an RData file. Both Rdata and Rds are R objects that you save for later use. We use RData if we want to save one or multiple objects in one data file. However, Rds is used to save a single object. For example:

save(a,b,c, file="a+b+c.Rdata")

saveRDS(a, "a.rds")
saveRDS(b, "b.rds")
saveRDS(c, "c.rds")
names(comments.df)
 [1] "Comment"               "AuthorDisplayName"     "AuthorProfileImageUrl"
 [4] "AuthorChannelUrl"      "AuthorChannelID"       "ReplyCount"           
 [7] "LikeCount"             "PublishedAt"           "UpdatedAt"            
[10] "CommentID"             "ParentID"              "VideoID"              
glimpse(comments.df)
Rows: 88
Columns: 12
$ Comment               <chr> "this is great as usual, I pray that you find mo…
$ AuthorDisplayName     <chr> "@birasafabrice", "@s2737474", "@simonramirezgar…
$ AuthorProfileImageUrl <chr> "https://yt3.ggpht.com/-p9B3cJqcuuEUgyt2w1WMV2gB…
$ AuthorChannelUrl      <chr> "http://www.youtube.com/@birasafabrice", "http:/…
$ AuthorChannelID       <chr> "UCkaDk6yajihpk7gOZVATkag", "UCVjbje1bA_A5mxzQe_…
$ ReplyCount            <chr> "0", "0", "0", "0", "2", "0", "2", "2", "0", "4"…
$ LikeCount             <chr> "3", "1", "1", "1", "0", "1", "0", "1", "0", "0"…
$ PublishedAt           <chr> "2022-12-10T19:34:54Z", "2022-12-08T19:24:05Z", …
$ UpdatedAt             <chr> "2022-12-10T19:34:54Z", "2022-12-08T19:24:05Z", …
$ CommentID             <chr> "UgwuOobz_bK-Dt2NpaF4AaABAg", "UgxgUyke6SH_bqIGG…
$ ParentID              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ VideoID               <chr> "d7uFpxtiXyk", "d7uFpxtiXyk", "d7uFpxtiXyk", "d7…

The comments.df data set is very network-y, showing the interaction of YouTube users about a video. We will use the YouTube comment data to demonstrate the forthcoming R sessions of social network analysis (and text-as-data analysis). In today’s work, we will stick with the scraping task.

2.3 Find a list of videos through keywords or hashtags

Similarly, if you want to study a YouTube comment network for videos sharing a topic, you may first collect a list of videos through keywords or hashtags. In our example, we search videos with “TidyTuesday” in their descriptions using yt_search. A similar function would be yt_topic_search.

# codes for scrapping; hold your horses. 
rtuesday0<-tuber::yt_search("TidyTuesday"
                    # published_after ="2016-02-01T00:00:00Z" # you can specify the time window
                    )
rtuesday1<-yt_search("TidyTuesday -declutter|house|housecleaning|room|bedroom|bathroom") # remove the "tidy" videos that are about decluttering or house cleaning

rtuesday2<-yt_search("TidyTuesday, data") # TidyTuesday & data

We use rtuesday1 data as it excluded those videos about house cleaning.

glimpse(rtuesday1)
Rows: 498
Columns: 17
$ video_id                  <chr> "6VAqqGlvT3o", "CRlbkBKI5iU", "vCBeGLpvoYM",…
$ publishedAt               <chr> "2023-03-21T23:30:06Z", "2022-11-16T02:07:39…
$ channelId                 <chr> "UCMGDKvc8-06jmxRrhYLr1_g", "UCeiiqmVK07qhY-…
$ title                     <chr> "Tidy Tuesday analysis: Ravelry yarn", "Tidy…
$ description               <chr> "Today I'm taking on the #tidytuesday challe…
$ thumbnails.default.url    <chr> "https://i.ytimg.com/vi/6VAqqGlvT3o/default.…
$ thumbnails.default.width  <chr> "120", "120", "120", "120", "120", "120", "1…
$ thumbnails.default.height <chr> "90", "90", "90", "90", "90", "90", "90", "9…
$ thumbnails.medium.url     <chr> "https://i.ytimg.com/vi/6VAqqGlvT3o/mqdefaul…
$ thumbnails.medium.width   <chr> "320", "320", "320", "320", "320", "320", "3…
$ thumbnails.medium.height  <chr> "180", "180", "180", "180", "180", "180", "1…
$ thumbnails.high.url       <chr> "https://i.ytimg.com/vi/6VAqqGlvT3o/hqdefaul…
$ thumbnails.high.width     <chr> "480", "480", "480", "480", "480", "480", "4…
$ thumbnails.high.height    <chr> "360", "360", "360", "360", "360", "360", "3…
$ channelTitle              <chr> "Equitable Equations", "David Robinson", "Da…
$ liveBroadcastContent      <chr> "none", "none", "none", "none", "none", "non…
$ publishTime               <chr> "2023-03-21T23:30:06Z", "2022-11-16T02:07:39…

In real project, defining keywords depends on your research topic. If the keywords are generic, such as “politics”, you will get a huge list of unwanted videos. In our case, keywords and hashtags turn out to be similar since the term “tidytuesday” is specific. See the list of videos for “#tidytuesday”, which has 470 videos from 72 channels. These statistics are close to our rtuesday1 data. 😆✌️

Fig. 4 “#tidytuesday” videos on YouTube

2.4 Channels

Continue working on the rtuesday1 data; let’s then explore which channel has the most videos of TidyTuesday.

rtuesday_channel<-rtuesday1 %>% 
  group_by(channelId) %>% 
  summarise(channel_counts=n()) %>% # the counts of TidyTuesday videos each channel has
  arrange(desc(channel_counts)) # order by the highest to the lowest

rtuesday_channel
# A tibble: 93 × 2
   channelId                channel_counts
   <chr>                             <int>
 1 UCOOZl_qCfHlXutEIbCorIjg             63
 2 UCeiiqmVK07qhY-wvg3IZiZQ             59
 3 UCUQ_k07_HMUPySlSC26knZg             48
 4 UCnwYO3Sz_emBTC1sTZ6TlsQ             39
 5 UCEYlsdkxG4hY6c5-5d3gxCw             24
 6 UCGW4bI5mAwAJvw5tBngFOWw             23
 7 UCnUfV6iRhtJ3CkqCaqtQxgw             20
 8 UCwVLSVjjZgR5U3a9SuPZKzw             14
 9 UCQ8ffPlas6SXN0MbSsKBb9A             13
10 UCDi8jjo62_cUbnlkAPcA0bA              9
# ℹ 83 more rows

Let’s see the top 5 channels. First, we extract their channel IDs.

top5_rtuesday_channel<-rtuesday_channel%>% 
  slice_head(n=5)

top5_rtuesday_channel
# A tibble: 5 × 2
  channelId                channel_counts
  <chr>                             <int>
1 UCOOZl_qCfHlXutEIbCorIjg             63
2 UCeiiqmVK07qhY-wvg3IZiZQ             59
3 UCUQ_k07_HMUPySlSC26knZg             48
4 UCnwYO3Sz_emBTC1sTZ6TlsQ             39
5 UCEYlsdkxG4hY6c5-5d3gxCw             24

Then, we can get the channel statistics using get_channel_stats. Here, since we want to extract the statistics of all five channels, we can write a function that applies to each channel simultaneously. Again, check out the variables we may obtain from the channel reference page in the API document.

# codes for scrapping; hold your horses. 

top5_rtuesday_channel_id<- top5_rtuesday_channel$channelId

get_all_channel_stats <- function(channelId) {
  tuber::get_channel_stats(channel_id = channelId) %>% # returns a list
    as.data.frame() %>% # convert list to dataframe
    rename(channelId=id) 
} 

attr_Top5_rtuesday_channel<-map_df(.x = top5_rtuesday_channel_id, 
                                   .f = get_all_channel_stats)
glimpse(attr_Top5_rtuesday_channel)
Rows: 5
Columns: 23
$ kind                              <chr> "youtube#channel", "youtube#channel"…
$ etag                              <chr> "ZDS1sRoAFnK3kxwoanOXKX1RudY", "kW8n…
$ channelId                         <chr> "UCOOZl_qCfHlXutEIbCorIjg", "UCeiiqm…
$ snippet.title                     <chr> "Nicola Rennie", "David Robinson", "…
$ snippet.description               <chr> "", "", "Talking about my minimalism…
$ snippet.customUrl                 <chr> "@nrennie", "@safe4democracy", "@min…
$ snippet.publishedAt               <chr> "2022-09-17T13:07:39.065885Z", "2007…
$ snippet.thumbnails.default.url    <chr> "https://yt3.ggpht.com/vpt1CYRNuA2d_…
$ snippet.thumbnails.default.width  <int> 88, 88, 88, 88, 88
$ snippet.thumbnails.default.height <int> 88, 88, 88, 88, 88
$ snippet.thumbnails.medium.url     <chr> "https://yt3.ggpht.com/vpt1CYRNuA2d_…
$ snippet.thumbnails.medium.width   <int> 240, 240, 240, 240, 240
$ snippet.thumbnails.medium.height  <int> 240, 240, 240, 240, 240
$ snippet.thumbnails.high.url       <chr> "https://yt3.ggpht.com/vpt1CYRNuA2d_…
$ snippet.thumbnails.high.width     <int> 800, 800, 800, 800, 800
$ snippet.thumbnails.high.height    <int> 800, 800, 800, 800, 800
$ snippet.localized.title           <chr> "Nicola Rennie", "David Robinson", "…
$ snippet.localized.description     <chr> "", "", "Talking about my minimalism…
$ snippet.country                   <chr> "GB", NA, "US", "US", NA
$ statistics.viewCount              <chr> "2947", "474694", "13825", "236504",…
$ statistics.subscriberCount        <chr> "90", "15300", "145", "5150", "1730"
$ statistics.hiddenSubscriberCount  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE
$ statistics.videoCount             <chr> "70", "118", "356", "69", "276"
Note

You may notice that the get_channel_stats function only fetches the variables from the parts of “kind,” “etag,” “id,” “snippet,” and “statistics.” This is the downside of using the off-the-shelf packages. To obtain additional information, you may want to modify this function by working on its source code or build your own API app.

2.5 Video statistics

The earlier comments.df presents statistics at the comment level. You may want to know further the video-level information, such as how many views and likes of this video. In this case, we can use the get_all_channel_video_stats function from the tuber package to obtain the statistics on all the videos in a user’s channel.

Another function, get_stats, does a similar task of extracting video statistics. However, it requires a list of video IDs and does not return the video creation time. So, I prefer to use get_all_channel_video_stats.

# codes for scrapping; hold your horses. 
nren_all_channel_video_stats<-
tuber::get_all_channel_video_stats(channel_id = top5_rtuesday_channel_id[1])

drob_all_channel_video_stats<-
tuber::get_all_channel_video_stats(channel_id = top5_rtuesday_channel_id[2])
glimpse(nren_all_channel_video_stats)
Rows: 70
Columns: 11
$ id               <chr> "-ZnJa0hodfU", "_Vnq_xeXeGU", "03rQ9lZlnwg", "0dDOACN…
$ title            <chr> "#TidyTuesday 2023 Week 22 - Verified Oldest People",…
$ publication_date <chr> "2023-05-29T13:08:39Z", "2023-11-26T13:19:48Z", "2023…
$ description      <chr> "The making of my #TidyTuesday visualisation in R for…
$ channel_id       <chr> "UCOOZl_qCfHlXutEIbCorIjg", "UCOOZl_qCfHlXutEIbCorIjg…
$ channel_title    <chr> "Nicola Rennie", "Nicola Rennie", "Nicola Rennie", "N…
$ viewCount        <chr> "55", "10", "32", "41", "39", "55", "21", "17", "63",…
$ likeCount        <chr> "2", "1", "0", "1", "1", "3", "1", "1", "0", "0", "0"…
$ favoriteCount    <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"…
$ commentCount     <chr> "0", "0", "0", "0", "0", "1", "0", "0", "0", "0", "0"…
$ url              <chr> "https://www.youtube.com/watch?v=-ZnJa0hodfU", "https…
glimpse(drob_all_channel_video_stats)
Rows: 120
Columns: 11
$ id               <chr> "-1x8Kpyndss", "-5HYdBq_PTM", "-RD8GNCNsCk", "-W-Oopv…
$ title            <chr> "Tidy Tuesday live screencast: Analyzing coffee ratin…
$ publication_date <chr> "2020-07-08T10:06:21Z", "2020-03-18T14:21:49Z", "2020…
$ description      <chr> "I'll analyze a dataset on coffee ratings in R, witho…
$ channel_id       <chr> "UCeiiqmVK07qhY-wvg3IZiZQ", "UCeiiqmVK07qhY-wvg3IZiZQ…
$ channel_title    <chr> "David Robinson", "David Robinson", "David Robinson",…
$ viewCount        <chr> "5166", "10469", "2089", "2682", "5145", "3559", "312…
$ likeCount        <chr> "144", "234", "56", "83", "161", "114", "69", "115", …
$ favoriteCount    <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"…
$ commentCount     <chr> "9", "15", "4", "4", "13", "6", "5", "10", "20", "10"…
$ url              <chr> "https://www.youtube.com/watch?v=-1x8Kpyndss", "https…
Tip

I have the above codes to take one channel’s video statistics at a time. It is fine for two or five channels. But what if we have many more? In that case, consider writing functions with apply families or looping to conduct repeated coding tasks.

Activity

Can you find the information on the corresponding variables from the API documentation?

2.6 Collecting a playlist

If your study is on a playlist of videos, we need to get each video’s IDs first to obtain its comments data using the above codes.

To do so, manually copying and pasting each video URL is not desired. Instead, we can automate the extraction of video IDs from a playlist using tuber and stringr. The following chunks show this automation using the TidyTuesday playlist of David.

drob_playlist_id <- stringr::str_split(
    string = "https://www.youtube.com/playlist?list=PL19ev-r1GBwkuyiwnxoHTRC8TTqP8OEi8", # the TidyTuesday playlist of David
    pattern = "=", 
    n = 2,
    simplify = TRUE)[ , 2]

The id variable is the video IDs of the playlist.

drob_playlist_video_ids<-drob_playlist_contentDetails$contentDetails.videoId

head(drob_playlist_video_ids, 5)
[1] "5ub92c-5xFQ" "3PecUbnuYC4" "Sx9uo2tCOFM" "0q-qmNkhWyY" "2RadZrpzTaA"

How many videos in this playlist? 81!

n_video_playlist<-length(drob_playlist_video_ids)
n_video_playlist
[1] 81

With this list of video IDs, we can further extract their comments, as we’ve worked on earlier. Meanwhile, we can also retrieve the statistics about this playlist, such as its description, created time, and more. See the variables of playlists we can obtain from the YouTube API.

Fig 5. API resource representation of YouTube playlists

We can use get_playlist_items to retrieve playlist information and specify in which part, such as “snippet,” “status,” “conentDetails,” the information we want to obtain.

# codes for scrapping; hold your horses. 
drob_playlist_contentDetails<- tuber::get_playlist_items(filter = c(playlist_id = drob_playlist_id),
                                        part = "contentDetails", # specify the part
                                        max_results = n_video_playlist,
                                  simplify = TRUE) #return a dataframe instead of a list. 

glimpse(drob_playlist_contentDetails)
# codes for scrapping; hold your horses. 
drob_playlist_snippet<- tuber::get_playlist_items(filter = c(playlist_id = drob_playlist_id),
                                        part = "snippet",
                                  # set this to the number of videos
                                  max_results = n_video_playlist,
                                  simplify = F) # 
Note

When getting the snippet information of the videos using the above codes specifying “simplify=T”, an error message appears. This error message was about the failure to convert the list to a dataframe.

Therefore, I specified simplify = F in the above chunk, which returns to a complicated nested list. Converting this list at such a complexity level to a dataframe is beyond this class’s scope. So I will leave it here for now. You may want to give it a try.

# see the complexity of this list
#drob_playlist_snippet[["items"]][[1]][["snippet"]]$description
Back to top